Lab Prompt

“You are a scout for the worst team in the NBA, probably the Wizards. Your general manager just heard about Data Science and thinks it can solve all the teams problems! She wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal to get the team to the playoffs!”

Summary of Approach to Recommending Players for Acquisition

This site will showcase my data-driven approach for recommending players for acquisition an NBA team. I am using a two-stage approach that will combine an unsupervised machine learning clustering approach and a supervised machine learning regression model to make educated predictions about which high performing players are underpaid and thus ideal targets for acquisition.

  1. Conduct unsupervised machine learning k-means clustering. This will take all relevant features into account and produce another feature, cluster, which will eventually aid in producing a more accurate supervised machine learning regression model. In order to decide the ideal number of clusters to use for this dataset, I will use a function to evaluate explained variance over a range of number of clusters in order to reveal which number of clusters maximizes explained variance while minimizing complexity

  2. With clustering complete, I will turn to produce a 3d visualization that will show players that are performing highly amongst the stats most closely correlated with salary in order to reveal the high-performing players that are underpaid relative to their peers. In order to identify the features that are the most correlated with salary, I will develop a correlogram between all of the relevant features in the dataset.

  3. I will then develop a supervised machine learning regression model to make predictions on what a player should be earning considering their performance stats. I will evaluate different regression models such as rpart2 decision tree regression and a generalized linear model to see which model produces the most accurate predictions. Equipped with salary predictions, I will investigate the players who had high performance metrics as seen from the 3d visualization and see if the models predicts that these players are underpaid.

Data

I will be using a dataset of 401 NBA players throughout the 2020-2021 season that includes the following information and stats:

  • Player
  • Position
  • Age
  • Tm (Team)
  • G (Number of Games Played in)
  • GS (Number of Games Player Has Started in)
  • MP (Minutes Played)
  • FG (Field Goals)
  • FGA (Field Goals Attempted)
  • FG. (Field Goal Percentage)
  • X3P (Three point baskets)
  • X3PA (Three point shot attempts)
  • X3P. (Three point shot percentage)
  • X2P (Two point baskets)
  • X2PA (Two point shot attempts)
  • X2P. (Two point shot percentage)
  • eFG. (effective field goal percentage)
  • FT (Free Throws)
  • FTA (Free Throws Attempted)
  • FT. (Free Throw Percentage)
  • ORB (Offensive Rebounds)
  • DRB (Defensive Rebounds)
  • TRB (Total Rebounds)
  • AST (Assists)
  • STL (Steals)
  • BLK (Blocks)
  • TOV (Turnovers)
  • PF (Personal Fouls)
  • PTS (Points)
  • 2020-2021 (Player’s Salary in 2020-2021 Season)

Data Preparation and Variable Selection

For this dataset, I removed players from consideration who had incomplete stat reports. Removing players with NA values for some of their information took 36 players out of consideration. Considering that 401 out of the original 437 players were still included to inform the models and visualizations and be considered as candidates for acquisition, removing players with incomplete sets of stats was not a decision that rendered this dataset useless.

As I selected the variables to be considered for consideration in the models, I removed variables that would not provide value to the model or could not be processed such as the name of the players (Player) and the name of the teams (Tm). Columns that referenced shooting data columns - made shots and attempted shots - were removed as the shooting percentage stats captured that data. At this point, I normalized the data for clustering to mitigate the risk of variables with larger scales overshadowing variables with smaller scales. As I produced the initial clustering model, position had to be removed from consideration as the kmeans clustering approach that I employed cannot process categorical data.

Unsupervised Machine Learning: K-means Clustering

With the data cleaned and prepared, the first thing that I did was use the data in a k-means clustering model. Based on the features (variables) in consideration, K-means clustering assigns each player to a cluster in an effort to sort (basically categorize) similar data together. This provides value when I go to make a supervised machine learning approach as the information about what cluster each player is assigned to can be used as a new feature that could be associated with their salary and help the model in making more accurate predictions.

Using this elbow plot, I visualized the explained variance metric that would be outputted if you run a k-means clustering approach on this data with different k values (number of clusters) The point of inflection on this elbow plot exists at k = 3 so the ideal number of clusters for this dataset is 3.

Using Correlogram to Identify Best Predictors of a Player’s Salary

At this point, I also created a correlogram which shows which of a player’s stats are most correlated with their salary.

This correlogram suggested that Assists (AST), Points (PTS), and Turnovers (TOV) were the three variables most correlated with predicting a player’s salary. As a result, these were the three variables that I selected to visualize a players and their salaries in a 3D visualization.

3D Visualization of Assists, Points, Turnovers and Salary

As these three variables are the best individual predictors of a players salary, I graphed them expecting to find the players with the best stats across these variables as the ones who would be earning the highest salary. However, I also expected to find players who are high-performing across these three crucial stats that were compensated significantly less than players of similar caliber, and these would be targets for acquisition that should be given further consideration. To see the discrepancies between different players’ salaries, I plotted a player’s salary as the size of their plotted point. The idea being that a player with a small circle amongst players with much larger circles would be a player that is paid significantly less than other players of a comparable caliber.

## null device 
##           1

This visualization provoked interest in several players, specifically Trae Young, Donovan Mitchell, DeAaron Fox, Bam Adebayo, Shai Gilgeous Alexander, and LaMelo Ball.

Developing Supervised Machine Learning Regression Model to Predict Salary

After examining this visualization and equipped with cluster data from my initial k-means clustering, I then implemented a supervised machine learning regression approach to further examine the relationship between performance and compensation in order to predict who would be the most cost-efficient players to acquire. This model would consider all of their stats and the cluster that they were assigned to in the earlier k-means clustering model.

Evaluating the performance metrics RMSE, Rsquared, and MAE while changing the hyperparameter maxdepth in order to identify the maxdepth level that maximizes performance while minimizing complexity

## CART 
## 
## 281 samples
##  28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 253, 253, 253, 253, 252, 253, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE     Rsquared   MAE    
##    1        5107099  0.7235741  4109691
##    2        3140121  0.9075477  2416539
##    3        3203877  0.9056782  2444173
##    4        3203877  0.9056782  2444173
##    5        3203877  0.9056782  2444173
##    6        3203877  0.9056782  2444173
##    7        3203877  0.9056782  2444173
##    8        3203877  0.9056782  2444173
##    9        3203877  0.9056782  2444173
##   10        3203877  0.9056782  2444173
##   11        3203877  0.9056782  2444173
##   12        3203877  0.9056782  2444173
##   13        3203877  0.9056782  2444173
##   14        3203877  0.9056782  2444173
##   15        3203877  0.9056782  2444173
##   16        3203877  0.9056782  2444173
##   17        3203877  0.9056782  2444173
##   18        3203877  0.9056782  2444173
##   19        3203877  0.9056782  2444173
##   20        3203877  0.9056782  2444173
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 2.

Building a generalized linear model to see if it provides better performance than the rpart2 model

## Generalized Linear Model 
## 
## 281 samples
##  28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 253, 253, 253, 253, 252, 253, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE    
##   2808440  0.9233609  2105813

The Generalized Linear Model (glm) provides lower RMSE and MAE values and a higher Rsquared value which means that the glm model is a more accurate predictor of salary than the rpart2 model.

Evaluating variable importance of the generalized linear model

## glm variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##          Overall
## cluster2 100.000
## cluster3  76.643
## Age       22.578
## G         17.290
## DRB       10.618
## AST       10.457
## BLK        9.853
## ORB        9.103
## FGA        7.339
## X3P        7.237
## FG         7.201
## X3PA       6.412
## PosPF      4.083
## TOV        3.951
## PosPG-SG   3.713
## PF         2.949
## FT.        2.614
## STL        2.415
## PosSG      1.882
## GS         1.667

Evaluating Supervised Machine Learning Regression Model Performance and Salary Predictions

Performance Metrics From Generalized Linear Model

##         RMSE     Rsquared          MAE 
## 2.535178e+06 9.337872e-01 1.881484e+06

I developed a few machine learning regression models, and I ultimately chose to proceed with a generalized linear regression model. This model produced the following metrics:

  • RMSE = 2535178 The RMSE value of 2535178 is the root of the average squared error (difference between the actual salary of a player and the salary that our model predicted). This means that on average, the difference between the actual salary of a player and the salary that our model predicted is approximately $2,535,178.
  • Rsquared = 0.933 This Rsquared value means that 93.3% of the variance in salary can be explained by the independent variables considered. As a perfect Rsquared value is 1.00, this means that the model is performing well in predicting salaries.
  • MAE = 1881484 MAE (mean absolute error) also communicates the average error which means that according to MAE, this model is inaccurate by $1,881,484 on average. Although they both communicate average error of the model, RMSE and MAE differ because MAE is calculated with equal consideration of each error because its based on a linear equation whereas due to the squaring of error in RMSE more emphasis is placed on larger errors.

With these metrics showing that the model is performing well, I then used the model to make predictions on what a player’s salary should be based on their stats. By subtracting a player’s actual salary from their predicted salary, I developed a column (pred_vs_obs_residual) that could then be filtered on to identify the players who are the most underpaid according to the model.

Final Analysis

From the 3D visualization, I became interested in Trae Young, Donovan Mitchell, DeAaron Fox, Bam Adebayo, Shai Gilgeous Alexander, and LaMelo Ball as they had high performance markers and appeared to be significantly underpaid relative to their peers. With interests in these players established, I then looked at the salary predictions that my supervised ML regression model made to see which ones would be the most cost-effective to acquire.

  • Trae Young Predicted - Actual Salary = $4,151,926
  • Donovan Mitchell Predicted - Actual Salary = $2,897,025
  • DeAaron Fox Predicted - Actual Salary = $1,135,911
  • Bam Adebayo Predicted - Actual Salary = $2,083,290
  • Shai Gilgeous Alexander Predicted - Actual Salary = $3,169,741
  • LaMelo Ball Predicted - Actual Salary = -$193,005

Players to target for acquisition: Trae Young, Donovan Alexander, and Shai Gilgeous Alexander

Trae Young

Our generalized linear model predicted that Trae Young would earn $10723726 but during the 2020-2021 season he was only paid $6,571,800. Our metrics of error for our generalized model recognizes that the average error of our predictions is approximately $2.5 million or $1.8 million (depending on whether you use RMSE or MAE). Even if you consider the possibility that the model over-predicted Trae Young’s salary by the average error according to RMSE, Trae Young would still be earning more than this value. Amongst the players in consideration, I am the most confident that Trae Young is being underpaid, so signing him is a great opportunity to gain a high-caliber player for less money.

Trae Young plotted in the 3d visualization of Assists, Points, Turnovers, and Salary

Donovan Mitchell

Looking at the 3d model, Donovan Mitchell is another player that we would expect to be underpaid, and our generalized linear model confirms this. Our generalized linear model predicted that Donovan Mitchell would earn $8,092,526 but during the 2020-2021 season he was only paid $5,195,501.

Donovan Mitchell plotted in the 3d visualization of Assists, Points, Turnovers, and Salary

Shai Gilgeous Alexander

Looking at the 3d model, Shai Gilgeous Alexander is another player that we would expect to be underpaid, and our generalized linear model confirms this. Our generalized linear model predicted that Donovan Mitchell would earn $7,311,061 but during the 2020-2021 season he was only paid $4,141,320.

Shai Gilgeous Alexander plotted in the 3d visualization of Assists, Points, Turnovers, and Salary